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Abstract 

This paper presents a novel software driven 
voltage tuning method that utilises multi-purpose 
Ring Oscillators (ROs) to provide process vari¬ 
ation and environment sensitive energy reduc¬ 
tions. The proposed technique enables voltage 
tuning based on the observed frequency of the 
ROs, taken as a representation of the device speed 
and used to estimate a safe minimum operating 
voltage at a given core frequency. A conservative 
linear relationship between RO frequency and 
silicon speed is used to approximate the critical 
path of the processor. 

Using a multi-purpose RO not specifically im¬ 
plemented for critical path characterisation is a 
unique approach to voltage tuning. The parame¬ 
ters governing the relationship between RO and 
silicon speed are obtained through the testing of 
a sample of processors from different wafer re¬ 
gions. These parameters can then be used on all 
devices of that model. The tuning method and 
software control framework is demonstrated on a 
sample of XMOS XS1-U8A-64 embedded micro¬ 
processors, yielding a dynamic power saving of 
up to 25% with no performance reduction and no 
negative impact on the real-time constraints of 
the embedded software running on the processor. 

1 Introduction 

Modern embedded computing systems require 
ever-increasing performance from microproces¬ 
sors whilst simultaneously consuming less energy. 
Progress in both of these areas leads to new op¬ 
portunities in embedded applications. 

Advances in silicon fabrication technologies 
bring shrinks in feature size, which along with 


increased transistor counts, helps to reduce power 
by needing a lower operating voltage than the 
previous generation of devices. However, the re¬ 
lationship between feature size, power and per¬ 
formance becomes more complex and subject to 
greater variability with smaller process technolo¬ 
gies, and so feature size alone cannot be relied 
upon to improve both performance and power. 

Techniques such as Dynamic Voltage and Fre¬ 
quency Scaling (DVFS) [BPSHOO) . power- and 
clock-gating, and advanced sleep states |ANDS0^ 
are used in combination with operating systems 
and application software to ensure energy con¬ 
sumption is minimised whilst delivering sufficient 
performance for a set of tasks. 

In order for a processor manufacturer to ship a 
product and guarantee reliability, it must select 
operating parameters that will enable the product 
to function correctly in spite of any process varia¬ 
tion across the range of parts. As such, the oper¬ 
ating voltages in the product data sheet must be 
chosen with a degree of conservatism. Addition¬ 
ally, products may be binned into different speed 
or power categories, depending on their behaviour 
under test, but there will still be variation, even 
across the binned parts, which presents challenges 
in categorising parts reliably [SPKGIO] . 

Ring Oscillators (ROs) are free-running logic 
blocks that operate at a frequency governed by 
the delays intrinsic to their structure and also the 
behaviour of the silicon within which they are 
implemented. Therefore, their behaviour varies 
from chip to chip. This can be exploited in various 
ways, for example as a random number seed, or 
for clock generation. In this paper, ROs combined 
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with hardware counters are used to determine the 
most appropriate operating voltage for a processor 
given the observed RO speed and the desired 
operating frequency. 

The XMOS XS1-U8A-64 processor is used as 
the test subject for this technique, by virtue of 
its voltage scaling capabilities and RO implemen¬ 
tation. The ROs are embedded in the processor 
silicon, but are separate components within the 
processor accessible via software and so can be 
considered multi-purpose. 

This paper makes the following contributions 
to the areas of energy efficient embedded soft¬ 
ware/hardware co-design and embedded processor 
architectures: 

• A unique application of multi-purpose 
software-controlled ROs, rather than custom 
hardware blocks. 

• A flexible soft control loop for voltage tuning 
a system, that is both process variation and 
temperature sensitive. 

• The control method, although implemented 
in software, has zero impact on the timing of 
an application running on the processor. 

• Granularity of control is unconstrained in 
software as characterisation formulae are 
used rather than a table of operating states. 

• The method’s power saving capabilities are 
demonstrated, through testing and evalua¬ 
tion on a set of samples of the target device. 

The rest of this paper is organised as follows. 
Section explores frequency and voltage scal¬ 
ing, existing voltage tuning approaches including 
methods for evaluating the silicon speed, and uses 
of ROs in processors. In Section a new tuning 
approach is described in the context of the cho¬ 
sen target hardware. Section shows the results 
of testing this technique on a sample of target 
processors. An evaluation of the technique’s ef¬ 
fectiveness is presented in Section followed by 
discussion of future work in Section [6] and conclu¬ 
sions in Section [7l 

For clarity, this paper refers to energy and 
power in terms of consumption and dissipation 
respectively. That is, energy consumption is a 
measure of total work done — the amount of po¬ 
tential that is transformed in order to achieve the 


desired outcome and typically measured in Joules. 
Power dissipation is an instantaneous measure 
of a rate of energy transfer, expressed in Watts. 
Power dissipation at 1 Watt for 1 second results in 
an energy consumption of 1 Joule. The majority 
of this paper refers to power, rather than energy, 
for consistency. 


2 Background 

This paper builds upon research into and applica¬ 
tion of techniques in the areas of CMOS device 
properties, voltage and frequency scaling, and 
ROs. This section covers the relevant background 
within these three areas. 

2.1 Power dissipation and DVFS 

The technique of Dynamic Voltage and Frequency 
Scaling (DVFS) is motivated by a desire to min¬ 
imise energy consumption in a device by operating 
in the most efficient possible trade-off of power 
vs. performance for a given workload |BPSB0n] . 
DVFS is affected mainly by two components of 
power dissipation in a CMOS device: static and 
dynamic power. 

Static power 

The main component of static power is the leakage 
current of the transistors in the silicon. This is 
present regardless of the on/off state of transistors. 
As processors are fabricated on smaller process 
nodes, the percentage of overall power dissipation 
that is attributed to leakage is growing |KABM0^ , 
for example due to increased leakage through 
thinner gate oxide layers, which must be combated 
with technology such as improved high-k gate 
dielectrics |WWAni] . 

Ps = VIleak ( 1 ) 

In Equation the static power, Pg, is the prod¬ 
uct of the device voltage, V, and the leakage 
current, /leak- Thus, there is a simple linear re¬ 
lationship between operating voltage and static 
power. 
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Dynamic power 

Power dissipated in order to switch transistors 
on or off is termed dynamic power, P^, and is 
expressed in Equation 

Pd = aCs^V^F ( 2 ) 

Csvi is the capacitance of the transistors in the 
device and a is an activity factor or the propor¬ 
tion of them that are switched. Activity factor is 
workload specihc, but often estimated as switch¬ 
ing half of the transistors in the device |BTM00] . 
giving a = 0.5. E is the operating frequency of 
the device. Observe that changes in V have the 
biggest influence on dynamic power dissipation. 

A reduction in V, however, will slow the tran¬ 
sistor switching speed, increasing the delay in the 
critical path, requiring that F also be lowered. 
Thus, there is a trade-off between reduced power 
dissipation and the total energy consumption due 
to longer execution time — in some cases it is not 
benehcial to slow the device down further. Choos¬ 
ing a strategy for energy saving, be it tuning the 
frequency to avoid slack time, or racing to idle 
by operating at high speed briefly, then reducing 
to a low power state, is dependent on the type of 
work and the behaviour of the system; there is 
not one strategy that works in all cases |ANC08) . 

The relationship between voltage and frequency 
varies depending on manufacturing process and 
device implementation. Simplistic representa¬ 
tions, such as that in [KABMO^ . represent the 
relationship as E oc , where Tth is the thresh¬ 

old voltage of the transistor. As V approaches 
Tth) ^ approaches zero. The nominal operating 
frequency and voltage, Enorm and fnorm respec¬ 
tively, can therefore be represented as Equation 
taken from |KABM03] . where Tmax is the maxi¬ 
mum operating voltage of the transistor. 

T/ _ IT’ f ^ ^th \ , Tth 

^norm — -f^norm I jy I “r 

\ Emax / ^max 

A step reduction in frequency yields a smaller 
step reduction in voltage. With a conservative 
view, where preserving correct operation is re¬ 
quired, the relationship can be represented lin¬ 
early. 


Eigure 1: Basic construction of a ring oscillator, 
with feedback of output Q to the first 
inverter in the chain. Wire lengths and 
the number of inverters affect the oscil¬ 
lator frequency. 

Other losses 

Conditions such as short-circuit current can also 
be factored into the overall power dissipation of 
a device. Techniques such as the a-power law 
MOS model consider these |Sak88] . fn this paper, 
however, these additional effects are considered to 
be part of either dynamic or static power, depend¬ 
ing on their relationship to transistor switching 
activity. 

2.2 Ring Oscillators 

An RO is typically implemented as a series of 
connected inverters, with the final inverter’s out¬ 
put looped back to the input of the hrst. Pro¬ 
vided an odd number of inverters are used, the 
circuit will be astable and the output will switch 
states continuously at a frequency governed by 
the propagation delays in the inverters and their 
connecting wires. 

The simplest model for the frequency of an 
oscillator is determined by the number of inverters 
that form it |MS10j . Equation expresses a RO’s 
frequency, Eq, as the number of inverters, N, 
and the propagation delay of each inverter, Ejnv, 
where 2 inversions produce a cycle. 


The delay term, Einv, is dependent on multiple 
factors, including manufacturing process, transis¬ 
tor size, operating voltage and device temperature. 
This makes a RO unsuitable on its own as a stable 
clock source, but creates various other possible 
applications. 

ROs can be used for a wide range of purposes, 
including as an entropy source for hardware ran¬ 
dom number generation [XMOIO] . as voltage con¬ 
trolled oscillators within PLLs (Phase Locked 
Loops) |WK(f94j . or as part of control circuitry 
for voltage-sensitive components |BPSB00] . 
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2.3 Frequency and voltage selection, 
critical path estimation 

In a typical use of DVFS in general purpose com¬ 
puting, the operating system will instruct a pro¬ 
cessor to change its power state depending on 
its workload [BG13j . This will trigger a change 
in frequency and/or voltage, balancing perfor¬ 
mance with energy consumption. The voltage 
and frequency points are typically selected by the 
processor manufacturer and must ensure valid 
operation for all processors of a given model. As 
such, they must be sufficiently conservative to 
account for process variation from manufacturing 
the processors, where the location of the die on 
the wafer may affect its speed. The challenge lies 
in monitoring or correctly modelling the critical 
path or paths in the processor. In embedded and 
deeply embedded systems, DVFS may be applied 
using different constraints, or without the assis¬ 
tance of an operating system, but shares the same 
power-saving goals. 

Various hardware-assisted approaches for volt¬ 
age tuning exist. In-situ error detectors can be 
placed into a processor design |DRL06] . These 
detectors can identify when the voltage is too low 
(or the frequency is too high), and then appro¬ 
priate action can be taken to correct the timing 
issue and re-execute any failed instructions. 

A delay line can be used to characterise the 
critical path. In INS'*~12 , a Universal Delay Line 


(UDL) is introduced, which aims to be portable 
across designs by containing a gate structure that 
minimises delay error and thus act as a reliable 
input for voltage control. Multiple UDLs are used 
to account for within-die variation. The reported 
results demonstrate that voltage tuning using this 
method achieves a 27% active power reduction. 

A similar approach has been used in Field Pro¬ 
grammable Gate Arrays (FPGAs), in which a 
delay line was implemented that was timed in 
order to establish whether the FPGA fabric was 
operating quickly enough, or needed additional 
voltage [NNY12) . This allows the FPGA to pro¬ 
vide a reconhgurable hardware control module to 
a system, with tightly tuned voltage and frequency 
scaling capabilities. Further research embeds in- 
situ detectors into arbitrary IP blocks targeting 
an FPGA, to achieve a similar goal to the de¬ 
lay line approach, but more closely integrated 
with the target IP |NY13j . The DVFS control 


is tightly connected with the in-situ error detec¬ 
tion circuitry to ensure that changing operating 
conditions do not lead to unrecoverable errors. 

Critical paths can be estimated via other meth¬ 
ods, such as |LSlO], in which the multiple possible 
critical paths of a complex processor, combined 
with the variations introduced from modern sil¬ 
icon process technologies, are used to create a 
representative model. This reflects the worst case 
delay of the circuit and is shown to have an aver¬ 
age error margin of less than 2.8%, with a lower 
level of pessimism required than other estimation 
methods of capturing the critical path delay. 

Ring Oscillators can be used as part of a control 
loop in a DVFS implementation |BPSB00] . The 
RO can be used for directing a voltage controller 
when frequency changes are requested. Changes 
in RO speed are taken to be a simplified analogue 
of changes in the critical path of the hardware, 
forming part of the feedback loop to the frequency 
selector, which adjusts the supply voltage until 
the target frequency is reached. 

Other approaches, such as that of [LSOOj . in¬ 
stead implement a selection of frequencies and 
voltage in a table, controlled by software, which 
can schedule changes to the frequency and voltage 
based on the worst-case execution time of a set 
of tasks that form a workload. 

In power-gated circuits, the gate sizing can be 
exploited as a method of adaptive power control. 
In |HHllj . a network of power gates are selectively 
enabled. A smaller number of enabled gates lim¬ 
its the voltage supplied to the connected logic. 
Device activity is monitored by measuring supply 
voltage, where a period of switching activity will 
result in a dip in voltage, followed by a return 
to previous levels, thus the loading of the circuit 
can be observed and the voltage during slack pe¬ 
riods optimised, resulting in a 12% average power 
reduction. 

Comparison 

The key differences between our contribution and 
prior work are, firstly, that the proposed imple¬ 
mentation utilises an existing hardware block that 
is designed for multiple functions, not specifically 
as a critical path model or as part of a hard¬ 
ware control loop. Secondly the proposed volt¬ 
age tuning approach forms a hardware/software 
control loop, in which the voltage selection deci- 
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sions, as well as safety margins, are implemented 
in software. Further, the control algorithm uses 
characterisation formulae, rather than table look¬ 
ups, to provide a target voltage, thus imposes 
no software restriction on the number of possible 
voltage/frequency selections. 

These differences provide greater flexibility and 
potential portability to other systems than related 
work. However, the latency increase incurred from 
implementing the control system in software lim¬ 
its the ability to save power over fine-grained time 
intervals, and the simpler hardware block used to 
represent the critical path necessitates a more con¬ 
servative safety margin. Possible improvements 
to these areas are discussed in Section [6l 

3 Implementation 

This section describes the selected processor fam¬ 
ily for use in experimentation, along with the 
software technique used to apply RO-based volt¬ 
age tuning to the devices. 

The following requirements are key to the abil¬ 
ity to apply the proposed voltage tuning tech¬ 
nique: 

• A configurable power supply, with sufficiently 
fine-grained control to allow changes to the 
device’s core supply without necessarily need¬ 
ing to change frequency. 

• Conhgurable frequency, at run or boot time, 
and ideally dynamically. 

• Internal ROs, attached to hardware counters, 
to provide assessment of the device’s speed. 

The XMOS XS1-U8A-64 processor was selected 
based on these criteria. Other processors, partic¬ 
ularly soft-cores for FPGAs, could also be used, 
with some modification to include ROs that can 
be sampled, using similar methods to the delay 
line or in-situ detectors described in |NNY12j 
and [NY 13] . However, the XMOS processor has 
all the required capabilities readily available. 

3.1 Test device: XMOS XS1-U8A-64 
processor 

The XS1-U8A-64 combines a hardware multi¬ 
threaded XSl processor with a set of peripherals 


that provide a USB PHY, various analogue com¬ 
ponents such as ADCs, and configurable power 
supplies. 


The XSl multi-threaded architecture has I/O 
and peripheral component control built into the 
instruction set, rather than memory-mapped. The 
architecture is described in more detail in |May09| 
and MDO^D^ . It is used to implement flexible 
hardware interfaces in software using a C-like 
language, with very low latency (as little as 10 nS) 
between pin activity and software response. The 
predictable timing of the architecture makes it 
well suited to hard real time embedded software. 
Of particular interest to this experiment, each 
core has four ROs within it, with two distinct 
implementations and two locations in the design. 
One of each RO implementation is placed near 
the I/O ports of the device, and the other two 
are located near the processor core. 

These ROs act as clock sources for a set of 16- 
bit hardware counters, which can be selectively 
enabled/disabled and the counter values read with 
a simple sequence of instructions [XMOIO] . Thus, 
by enabling a RO’s counter for a specihed period, 
the speed of the RO can be determined. Even 
without detailed knowledge of the RO implemen¬ 
tation, its speed can be compared to other chips 
of the same series, assuming a consistent reference 
clock for timing. 

Peripheral components of the XS1-U8A-64 are 
presented as endpoints an XMOS device net¬ 
work, accessible via the channel communications 
paradigms established in the XSl instruction set 
architecture |May09| . They are configurable in 
a similar way to I2C or SPI peripherals, but at 
the physical level and low-level in software, the 
interface is somewhat different. 

Three power supplies are provided in the pe¬ 
ripheral part of the XS1-U8A-64, one 3.3 V for 
I/O logic and two 1 V, for separate Phase-Locked 
Loop (PLL) and core supplies. For this research, 
the core supply is the only one that is adjusted. 
This particular supply can be configured between 
0.6 V and 1.3 V in 10 mV steps, with a recom¬ 
mended slew rate of 10 mV per microsecond to 
limit over- and under-shoot. 

Assuming a safe set of default conditions for 
both the power supply and core frequency, both 
can be changed dynamically during program exe¬ 
cution. The voltage should be changed no faster 
than the aforementioned slew rate, whilst the 
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core frequency can either be divided to a lower 
frequency on the fly, or the PLL can be repro¬ 
grammed to a new target frequency [MDO'*~d8 , 
triggering a soft-reset and reboot of the core. 

3.2 Software requirements and hardware 
considerations 

A software implementation of self tuning volt¬ 
ages and frequencies must consider the behaviour 
and capabilities of the underlying hardware whilst 
giving certain assurances to the application soft¬ 
ware that will be running upon it. An embedded 
environment with hard real-time constraints is 
considered. As such, a number of criteria must 
be given attention. 

Environment and workload affect silicon speed 

Transistor switching speed increases in an approx¬ 
imately linear relationship to voltage whereas ris¬ 
ing temperature can either increase or decrease 
speed, depending on the feature size [KKOBj . How¬ 
ever, higher voltages result in greater dynamic 
and static power dissipation, and so the relation¬ 
ships between design thresholds, workload, speed, 
voltage and temperature are not always straight¬ 
forward. For example, the relationship between 
temperature and threshold voltage can typically 
be represented linearly, but the static current 
leakage has an exponential relationship with tem¬ 
perature |WA12j . 

Processor temperature may be influenced by 
the ambient temperature of the operating envi¬ 
ronment, but also by the workload run upon it, 
as this will increase energy consumption and thus 
power dissipated as heat. 

In order to provide a reasonable expectation of 
safety in a voltage tuned chip, its speed should 
either be constantly monitored, or if this is not 
possible, it should be measured an appropriate 
limit of its operating temperature in the given 
environment. In the latter case, an environmental 
change may lead to a fault or sub-optimal energy 
usage. 

Performance cannot be impacted by voltage 
tuning 

If a given application is analysed and proven to 
work at a particular operating frequency, then 


the introduction of voltage tuning should not ad¬ 
versely affect that. This constrains the tuning 
to finding the lowest voltage for the currently 
assigned frequency. Other strategies may be ac¬ 
ceptable in other workloads, such as hnding a 
suitable frequency for a given voltage, in an envi¬ 
ronment where the nominal voltage may not be 
achievable. However, for this paper, the focus is 
upon tuning the voltage to the current frequency. 

Latency and deadlines cannot be adversely 
affected 

A hne-grained performance requirement in an 
embedded system is that response times to cer¬ 
tain events must be kept low in order for hard- 
deadlines to be met. As such, the process of mon¬ 
itoring the silicon speed or changing the voltage 
must not cause deadlines to be missed. The sim¬ 
plest method for guaranteeing this is to avoid any 
activity that would affect timing in any way, such 
as inserting additional tasks into the workload, or 
modifying existing tasks. 

3.3 Selected approach 

Based on the discussed criteria, the following im¬ 
plementation details are used in the voltage tuning 
framework: 

Silicon speed will be profiled and a new 
voltage applied before main application 
execution. 

This avoids any performance or fine-grained tim¬ 
ing issues by not introducing any extra processing 
during execution of the main application. The 
analysis time and power supply slew rate become 
decoupled from the constraints of the program. 
However the effect upon start-up time may be un¬ 
desirable for applications requiring a very rapid 
cold-start. It may also fail to account for en¬ 
vironmental changes, such as dramatic ambient 
temperature variation. 

A self-exercising routine will heat the 
processor before tuning. 

In order to ensure the speed of the device is mea¬ 
sured appropriately, a high-power test loop will 
be executed for a period of time before and con¬ 
tinue throughout the speed profiling phase. This 
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heats up the silicon prior to testing and keeps it 
warm during. This approach assumes that the 
chip package and circuit board’s heat dissipation, 
as well as the environment and processor work¬ 
load do not make for a gradual, unabated rise in 
operating temperature over a longer time period. 

XSl processor - up to 8 threads 


Time TO 71 72 73 74 75+ 



Figure 2: Depiction of voltage tuning process, in¬ 
cluding warm-up and RO sampling, fol¬ 
lowed by normal application execution. 

The framework is implemented as a series of 
libraries that provide control over the core voltage, 
routines for heating up the processor and a process 
for measuring the silicon speed and selecting an 
appropriate voltage for the given frequency. The 
implementation is outlined in Figure and an 
explanation follows. 

The simplest invocation of the framework is to 
request that it set the core voltage to the lowest 
safe level for the given clock speed. When doing 
this, the framework first determines the frequency 
by reading the PLL configuration and core clock 
divider in combination with a compile time macro 
that specifies the oscillator frequency. This as¬ 
sumes, therefore, that the oscillator frequency 
is correctly specified by the hardware designer 
and/or software developer. 

Next, the processor is heated for a period of 
time which at its 65 nm feature size, will slow 
the ROs |KK06| . This aims to reflect the silicon 
speed under a heavy workload. This is achieved by 
executing several threads of interleaved multipli¬ 


cation operations with specially selected operand 
values. This particular set of threads has been 
shown to maximise the power dissipation of the 
core |KE15| . 

After a warm-up period of one second, the val¬ 
ues of the counters connected to the ROs are 
recorded, then the counters enabled, increment¬ 
ing at a rate governed by the RO frequency. In 
testing, one second was found to be sufficient heat¬ 
ing time to produce the observable RO slowdown. 
After 85 /rS, the counters are stopped, re-read and 
the difference calculated. This measurement dura¬ 
tion captures a good sample of the RO frequency, 
without overrunning the 16-bit counters. This 
measurement step can be performed several times 
to establish an average. The thread responsible 
for configuration and sampling of the ROs is inac¬ 
tive the majority of the time, leaving the warm-up 
threads to fully occupy the pipeline. Once suf¬ 
ficient samples are collected, the slowest of the 
device’s ROs is then taken as an approximation 
of the silicon speed. 

Two RO scaling characteristics, Sf and Sy, 
must be determined, with the aim of providing 
an analog for Equation to give a target op¬ 
erating voltage and/or frequency (equivalent to 
determining appropriate Fnorm and I4orm)- Sf 
is the ratio between RO frequency, F/, and pro¬ 
cessor frequency, Fp, such that Fq indicates that 
the silicon is operating quickly enough to meet 
the timing requirements of the processor at Fp. 
This yields the inequality in Equation which 
states the minimum Fq for a given target proces¬ 
sor frequency. The second characteristic, Sy, is 
the ratio between core voltage and Fq, satisfying 
Equation 


rp -y 

Fo > „ 

Sf 

(5) 

V>Fo-Sy 

(6) 


If Fo is the current RO frequency and V is 
the current core voltage, then a new target RO 
frequency, F^, may be found that still satisfies 
Equation and similarly a new voltage, V' , that 
can provide F/, per Equation]^ 

In the above example it is assumed that the 
processor is operating safely and that a voltage 
optimisation is taking place. It is also possible to 
calculate a higher V' for a higher target processor 
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frequency, Fp, using the same method. In either 
case, V is calculated using Equation 

E' = E + 5, • (F' - Fo) (7) 

The framework’s interface to the power sup¬ 
ply can then perform the transition to V within 
a safe slew rate, after which the tuning process 
is complete and the application can start. This 
process is constrained within maximum and mini¬ 
mum supported voltages and frequencies, based 
upon the power supply capabilities, recommended 
limits and the operating ranges that were used in 
order to generate safe values for S'j and S'„. 

3.4 Characterisation 

Prior to testing this implementation, the RO char¬ 
acteristics 5/ and are determined through em¬ 
pirical measurement of an XMOS XS1-U8A-64 
processor, shown in Table 

The RO frequency in relation to voltage, is 
recorded, along with a conservative 5^, the result 
of a series of frequency and voltage tests in which a 
stress-test application was run to verify correct op¬ 
eration of the hardware at each frequency/voltage 
combination. The stress-test exercises multiple 
components of the processor simultaneously. It 
has three possible outcomes: success, where the 
test completes without error; failure, where an 
error is detected during execution; or crash, where 
the system becomes unresponsive. The test is not 
considered a certification of reliability (it comes 
with no guarantee from the vendor), but is suffi¬ 
cient for experimental purposes; if this test does 
not exhibit transient faults, none are seen in reg¬ 
ular applications on the same test bed. 


Term 

Value 

S.f 

1.7 

s. 

5.95e - 06 


Table 1: The RO scaling terms for a XSI-U8A-64 
processor. 


4 Testing 

Following the implementation of the voltage tun¬ 
ing framework, it was tested on nine XSI-U8A-64 
processors, covering three each of slow, typical 
and fast silicon samples. A test rig, capable of 


reading the power supply voltages, was used in 
order to accurately observe the changes in voltage 
that were applied. For each test run, the tuning 
process was performed and the target voltage set 
as per the description in Section 

To verify system stability at the tuned voltage, 
a stress-test application is used, as described in 
Prior to a full suite of tests, the 5/ 


Section 3.4 


and Sv parameters were tested on slow silicon to 
confirm the parameters were chosen correctly to 
avoid failures or crashes on all samples of the chip. 
The tuning framework and stress-test is run three 
times on each of the nine processors and average 
values collected. It is worth noting, however, that 
there was negligible variation between test runs 
on any given sample processor. Tests were con¬ 
ducted at 500 MHz and 400 MHz to demonstrate 
the capability to tune depending on the required 
system performance. 

Figure shows a box-plot of the reduction in 
static and dynamic processor power for each of 
the sampled chips at 500 MHz and 400 MHz, 
compared to the nominal operating voltage of 1 
Volt. There is no volt age/frequency table specific 
for this processor, so this is the typical operating 
point, per the datasheet. The change in static 
and dynamic power is determined by evaluating 
the voltage terms in Equations [T] and whilst the 
other terms remain unchanged. The figure also 
shows the kernel density plot of the data beneath 
the box-plots, forming a violin plot and projecting 
the behaviour for a larger sample set. 

At 500 MHz, the slowest processor of the sample 
set benefits from a 70 mV reduction in the core 
supply, yielding a dynamic power reduction to 
0.86 of the default and a static power reduction 
to 0.92 of its original state. The fastest processor 
gets a 140 mV reduction, lowering dynamic power 
and static power to 0.75 and 0.87 of their prior 
levels, respectively. 

At 400 MHz, the power savings are greater and 
the distributions more spread out, but follows 
the same shape as for 500 MHz, in line with the 
characteristics of the sampled chips. Figure 
projects the Fp and V combinations across a 
wider range, demonstrating the range voltages 
that would be applied to different chip samples 
for a particular Fp. 

This data demonstrates that RO tuning can 
save system energy in all cases, but most impor¬ 
tantly, can save more energy in processors where 
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Frequency and power dissipation type 

Figure 3: Violin plot of static and dynamic power 
savings for RO voltage tuning at 500 
and 400 MHz across a sample of nine 
XS1-U8A-64 chips. The top whiskers 
represent the saving in the slowest sili¬ 
con under test and the bottom whiskers 
represent the fastest sample. 

the silicon is fast enough to allow it. The achieved 
dynamic power saving across the sampled chips 
varies by 14% at the maximum operating fre¬ 
quency of 500 MHz, passing the critical path test 
application in all cases. 

5 Evaluation 


for suitable samples of silicon and achieve power 
saving in all samples. The actual power saving 
will depend on the behaviour of the application, 
but in a typical scenario this may reduce the 
power dissipation from approximately 150 mW 
to 110 mW, when considering the power profile 
of the XS1-U8A-64. 

A conservative and linear relationship between 
RO frequency, target core frequency and lowest 
safe voltage is used. As such, the energy saving 
at a given frequency is not the absolute minimum, 
nor does the energy saving exactly fit the curve 
of the silicon’s performance as voltage and tem¬ 
perature change. Specifically targeted hardware 
solutions are therefore able to better characterise 
the critical path and provide tighter voltage tun¬ 
ing. However, this approach is still able to provide 
reliable operation whilst saving energy, provided 
there are no severe variations in the operating 
conditions. 

The strength of this approach is in its use of 
general purpose hardware for characterisation and 
software for control. This creates a highly flexible 
voltage tuning implementation that can be eas¬ 
ily mapped to other similarly equipped devices, 
doesn’t interfere with the real-time behaviour of 
the running application, and is unconstrained in 
voltage/frequency selection except for any limita¬ 
tions imposed by the hardware. 


Our RO based voltage tuning has been shown to 
be effective at reducing device energy consump¬ 
tion with zero impact on program performance, 
save for a slower start-up time. The method is 
able to save over 25% of power in the core supply 
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Figure 4; Projection of safe voltage/frequency 
combinations within 0.6-1.0 V for the 
fastest and slowest silicon samples 
tested. 


6 Future work 

Areas of future work include more sophisticated 
software implementations for the control loop, 
integration with different hardware critical path 
estimation methods, and the application of this 
work to other architectures. This section discusses 
all of these areas in turn. 

Software DVFS 

The current software implementation is run be¬ 
fore program startup, minimising integration ef¬ 
fort and guaranteeing no disruption to program 
execution. There is scope for applying this tech¬ 
nique in a periodic manner, continuing to sample 
RO speed throughout program execution in order 
to adapt to environmental changes, or changes in 
the workload of the processor that might create 
more or less heat. On the current target architec¬ 
ture, this could be implemented as a dedicated 
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thread, provided the instrumented application’s 
resource requirements (with respect to number of 
threads and performance) would not be adversely 
affected. This would allow our approach to be 
used as a continuous controller, in a comparable 
manner to full hardware implementations such 
as IBPSBOOl [DRM IL^ . 

In some applications, it may be beneficial to ap¬ 
ply the constraints in reverse. For example, in an 
energy-scarce environment, a maximum voltage 
may be available, and so the control loop should 
tune frequency appropriately, maximising perfor¬ 
mance with the available voltage supply. This 
is a relatively straightforward task in terms of 
engineering the control framework, although the 
impact upon the performance of relevant applica¬ 
tions would need to be studied. 

Critical path characterisation 

One of the key contributions of this paper is the 
use of a software controlled multi-purpose hard¬ 
ware block, rather than a dedicated hardware 
block designed for critical path representation. 
However, the software control loop could be inte¬ 
grated with an appropriately instrumented critical 
path representation such as the UDL |lNS'*~12j . 
creating a more accurate control loop that is still 
software driven. 

In addition, techniques similar to monitor tim¬ 
ing slack such as that in could be applied, 

although sufficiently fine grained and accurate 
voltage samples may be impractical in a hardware- 
software control loop. 

Other devices, direct comparison 

Wider comparisons could be drawn by applying 
this technique to a range of architectures, starting 
by identifying those with similarly controllable 
RO hardware. Of particular interest is FPGAs, 
with which work such as |NNY12| INY13) could be 
directly compared to our method, by instrument¬ 
ing a design with each of the forms of sensing and 
control. 

7 Conclusions 

This paper has presented a technique for using 
multi-purpose ring oscillators to provide a charac¬ 
terisation of device speed, accounting for process 


variation across samples of a device and environ¬ 
mental factors such as temperature. This charac¬ 
terisation is utilised by a software control loop, 
which tunes the operating voltage of the device 
to a safe minimum at the required core clock fre¬ 
quency. The result is a variation sensitive power 
optimisation method that uses a simple hardware 
block and flexible software controller that can 
provide significant power savings. 

The characterisation and control method is 
demonstrated on a device with software accessible 
ring oscillators, the XMOS XS1-U8A-64 embed¬ 
ded microprocessor. In testing on a sample set 
of this processor, the method saves between 14% 
and 25% of dynamic power, demonstrating sen¬ 
sitivity to silicon speed and saving power in all 
test cases. 

The control method has zero impact on the per¬ 
formance of any application run on the processor 
as the optimisation is performed at start-up. The 
software implementation of this control method 
is available upon request to the authors. Fur¬ 
ther work has been proposed that would allow 
testing of this technique with other devices to 
allow closer comparison with other voltage tuning 
methods. In addition, the flexibility of the soft¬ 
ware control implementation could be leveraged 
to provide similar voltage tuning optimisations 
using other types of characterisation hardware or 
as a continuously operating control loop. 
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